Gently Introduce to Variable Selection using Bayesian Approach
This example is based on section 18.4 in Kruschke, 2015.
A Bayesian approach variable selection in regression analysis can be
done by adding a binary parameter for each slope:
\[y_i = \beta_0 + \sum_j \delta_j \ \beta_j
x_{i,j},\]
with \[\delta_j = 0, \ 1.\]
Every combination of values \(\delta_1, \ ..., \delta_k\) gives a submodel. Simple priors for delta-indicators are independent Bernoulli distributions: \(\delta \sim \text{dbern}(0.5)\). Then posterior probabilities \(P(\delta_j = 1\mid D)\) show importance of the corresponding predictors \(x_i\).
The diagram of such model is an easy generalization of regression
model.
Since parameters \(\delta_j\) are
discrete, we need to use JAGS: library(runjags)
instead of
Stan
library(runjags)
The SAT data includes:
State
: 50 states,SATV
: SAT verbal; SATM
: SAT math;
SATT
: SAT total scoreSpend
: average spending per pupilPrcntTake
: percentage of students who took the
testStuTeaRat
: the average student teacher ratio in each
state andSalary
: the average salary of the teachersThese latter four variables are also plausible predictors of SAT score.
<- read.csv("data/Guber1999data.csv")
dta names(dta)
[1] "State" "Spend" "StuTeaRat" "Salary" "PrcntTake"
[6] "SATV" "SATM" "SATT"
::datatable(dta) DT
We now let look at the correlation matrix among 4 predictors.
%>%
dta select(Spend, PrcntTake, StuTeaRat, Salary) %>%
cor(.) %>%
round(., 3)
Spend PrcntTake StuTeaRat Salary
Spend 1.000 0.593 -0.371 0.870
PrcntTake 0.593 1.000 -0.213 0.617
StuTeaRat -0.371 -0.213 1.000 -0.001
Salary 0.870 0.617 -0.001 1.000
That Salary is strongly correlated with Spend, and therefore, a model that includes both Salary and Spend will show a strong trade-off between those predictors and consequently will show inflated uncertainty in the regression coefficients for either one. Should only one or the other be included, or both, or neither?
One more point, response variable is average high school SAT score per state. One of the predictors is amount of money spent by the state on schools. Plotting amount of money spent against SAT scores shows negative slope.
plot(dta$Spend,dta$SATT)
Thus, should we cut funding for school? Because even with the funding, SAT would be reduced disproportional with the budget spent!!
<- dta[,'SATT']
y <- as.matrix(dta[ , c("Spend","PrcntTake","StuTeaRat","Salary")])
x
#prepare data in JAGS
<- list(Ntotal = length(y),
dataList y = y,
x = x,
Nx = ncol(x))
Description of the model:
\[y_i = \beta_0 + \delta_j \ \beta_j x_{i,j} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \delta_j \ \beta_j x_{i,j} \\ \beta_0 \sim N(0, \frac{1}{1/2^2}) \ ; \ \beta_j \sim t(0, 1/\sigma_{\beta}^2, \nu = 1) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda) \\ \delta_j \sim Bernoulli(0.5) \ ; \ \sigma_{\beta} \sim \text{Gamma(shape, rate)}\]
= "
modelString # Standardize the data:
data {
ym <- mean(y)
ysd <- sd(y)
for (i in 1:Ntotal) {
zy[i] <- (y[i] - ym) / ysd
}
for (j in 1:Nx) {
xm[j] <- mean(x[,j])
xsd[j] <- sd(x[,j])
for (i in 1:Ntotal) {
zx[i,j] <- (x[i,j] - xm[j]) / xsd[j]
}
}
}
# Specify the model for standardized data:
model {
for (i in 1:Ntotal) {
zy[i] ~ dt(zbeta0 + sum(delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigma^2 , nu)
}
# Priors vague on standardized scale:
zbeta0 ~ dnorm(0, 1/2^2)
for (j in 1:Nx) {
zbeta[j] ~ dt(0 , 1/sigmaBeta^2 , 1)
delta[j] ~ dbern( 0.5 )
}
zsigma ~ dunif(1.0E-5 , 1.0E+1)
# sigmaBeta <- 2.0 ## uncomment one of the following specifications for sigmaBeta
# sigmaBeta ~ dunif( 1.0E-5 , 1.0E+2 )
sigmaBeta ~ dgamma(1.1051,0.1051) # mode 1.0, sd 10.0
# sigmaBeta <- 1/sqrt(tauBeta) ; tauBeta ~ dgamma(0.001,0.001)
nu ~ dexp(1/30.0)
# Transform to original scale:
beta[1:Nx] <- (delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx])*ysd
beta0 <- zbeta0*ysd + ym - sum(delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx])*ysd
sigma <- zsigma*ysd
}
"
<- c("beta0", "beta", "sigma", "delta", "sigmaBeta", "zbeta0", "zbeta", "zsigma", "nu" )
parameters <- "parallel" # change to "rjags" in case of working on 1-core CPU
runjagsMethod # run JAGS
<- run.jags(method = runjagsMethod,
runJagsOut model = modelString,
monitor = parameters,
data = dataList,
n.chains = 3, #nChains <- 3
adapt = 500, #adaptSteps <- 500
burnin = 1000, #burnInSteps <- 1000
sample = ceiling(15000/3), #numSavedSteps <- 15000; nChains <- 3
thin = 25, #thinSteps <- 25
summarise = FALSE,
plots = FALSE)
# generate summary info for all parameters
summary(runJagsOut)
Calculating summary statistics...
Note: The monitored variable 'delta[2]' appears to be
non-stochastic; it will not be included in the convergence
diagnostic
Calculating the Gelman-Rubin statistic for 18 variables....
Error : The "modeest" package is required to calculate the mode of continuous variables
Lower95 Median Upper95 Mean
beta0 944.66400000 1.011275e+03 1.11231e+03 1.018969e+03
beta[1] -0.00128211 7.803640e+00 1.91766e+01 7.314723e+00
beta[2] -3.26471000 -2.761835e+00 -2.22909e+00 -2.753613e+00
beta[3] -5.57611000 0.000000e+00 4.47374e-03 -6.349370e-01
beta[4] -0.26275600 0.000000e+00 3.55510e+00 3.305153e-01
sigma 24.62630000 3.214530e+01 4.10201e+01 3.238811e+01
delta[1] 0.00000000 1.000000e+00 1.00000e+00 6.138000e-01
delta[2] 1.00000000 1.000000e+00 1.00000e+00 1.000000e+00
delta[3] 0.00000000 0.000000e+00 1.00000e+00 1.732667e-01
delta[4] 0.00000000 0.000000e+00 1.00000e+00 2.208667e-01
sigmaBeta 0.02406400 1.127925e+00 7.94953e+00 2.198454e+00
zbeta0 -0.12437500 -1.292225e-04 1.29719e-01 1.964852e-04
zbeta[1] -7.77130000 2.129285e-01 1.10354e+01 5.264458e-01
zbeta[2] -1.16775000 -9.878755e-01 -7.97319e-01 -9.849344e-01
zbeta[3] -24.01480000 -8.063585e-02 2.36377e+01 -2.338195e-01
zbeta[4] -17.73010000 1.036410e-01 1.87959e+01 3.378339e-01
zsigma 0.32913900 4.296320e-01 5.48246e-01 4.328772e-01
nu 1.91471000 2.685290e+01 9.61985e+01 3.558561e+01
SD Mode MCerr MC%ofSD SSeff AC.250
beta0 43.89744488 NA 0.4821033688 1.1 8291 0.019090161
beta[1] 7.04248924 NA 0.1017558219 1.4 4790 0.064596992
beta[2] 0.27026073 NA 0.0033981576 1.3 6325 0.053310641
beta[3] 1.74666227 NA 0.0151369398 0.9 13315 -0.009957991
beta[4] 0.99368906 NA 0.0093047567 0.9 11405 0.010317963
sigma 4.16357566 NA 0.0387315148 0.9 11556 0.016315434
delta[1] 0.48689359 1 0.0078742141 1.6 3823 0.085242124
delta[2] 0.00000000 1 NA NA NA NA
delta[3] 0.37849026 0 0.0034208127 0.9 12242 -0.004375483
delta[4] 0.41484462 0 0.0038539951 0.9 11586 0.010031033
sigmaBeta 3.24829579 NA 0.0687169662 2.1 2235 0.115133738
zbeta0 0.06462281 NA 0.0005367156 0.8 14497 -0.007738634
zbeta[1] 5.53722794 NA 0.2424652404 4.4 522 0.502781355
zbeta[2] 0.09666902 NA 0.0012154802 1.3 6325 0.053310325
zbeta[3] 18.13696053 NA 0.5572712553 3.1 1059 0.319423413
zbeta[4] 14.61246772 NA 0.3817438159 2.6 1465 0.293091432
zsigma 0.05564748 NA 0.0005176588 0.9 11556 0.016315483
nu 30.03665106 NA 0.2442544957 0.8 15122 -0.003220931
psrf
beta0 1.0000092
beta[1] 1.0001584
beta[2] 1.0002489
beta[3] 1.0006953
beta[4] 1.0006186
sigma 1.0000140
delta[1] 1.0006266
delta[2] NA
delta[3] 1.0008028
delta[4] 1.0006483
sigmaBeta 1.0016839
zbeta0 1.0001875
zbeta[1] 1.0614080
zbeta[2] 1.0002489
zbeta[3] 1.0456877
zbeta[4] 1.0327248
zsigma 1.0000147
nu 0.9999383
# plot all params
plot(runJagsOut,
plot.type = c("trace", "ecdf", "histogram", "autocorr"))
Generating summary statistics and plots (these will NOT be
saved for reuse)...
Calculating summary statistics...
Note: The monitored variable 'delta[2]' appears to be
non-stochastic; it will not be included in the convergence
diagnostic
Calculating the Gelman-Rubin statistic for 18 variables....
Error : The "modeest" package is required to calculate the mode of continuous variables
Find posterior probability of different combinations of predictors by calculating frequencies with which they appeared in MCMC.
<- as.matrix(runJagsOut$mcmc[,7:10])
trajectoriesDelta head(trajectoriesDelta)
delta[1] delta[2] delta[3] delta[4]
[1,] 0 1 0 0
[2,] 1 1 0 0
[3,] 1 1 0 0
[4,] 1 1 0 0
[5,] 1 1 1 0
[6,] 0 1 0 0
<- nrow(trajectoriesDelta) Nchain
<- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(1,1,0,0))))/Nchain) (config1
[1] 0.4738667
<- sum(apply(trajectoriesDelta, 1,function(z) prod(z==c(1,0,0,0))))/Nchain) (config2
[1] 0
<- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(0,1,0,0))))/Nchain) (config3
[1] 0.2154667
<- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(1,1,1,1))))/Nchain) (config4
[1] 0.02186667
<-sum(trajectoriesDelta[,1]==1)/Nchain) #Spend (inclSpend
[1] 0.6138
<-sum(trajectoriesDelta[,2]==1)/Nchain) #PrcntTake (inclPrcntTake
[1] 1
<-sum(trajectoriesDelta[,3]==1)/Nchain) #StueTeaRat (inclStueTeaRat
[1] 0.1732667
<-sum(trajectoriesDelta[,4]==1)/Nchain) #Salary (inclSalary
[1] 0.2208667
Spend
(0.6138) and
PrcntTake
(1).PrcntTake
is included with
non-stochastic probability of 1, standard deviation of \(\delta_2\) is zero.Warning. Posterior probabilities of inclusion may be very sensitive to prior information.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, Feb. 14). HaiBiostat: Series 5 of 10 -- Introduction to Selection of Variables using Bayesian Approach. Retrieved from https://hai-mn.github.io/posts/2022-02-14-Bayesian methods - Series 5 of 10/
BibTeX citation
@misc{nguyen2022series, author = {Nguyen, Hai}, title = {HaiBiostat: Series 5 of 10 -- Introduction to Selection of Variables using Bayesian Approach}, url = {https://hai-mn.github.io/posts/2022-02-14-Bayesian methods - Series 5 of 10/}, year = {2022} }